Recomputation Enabled Efficient Checkpointing

نویسندگان

  • Ismail Akturk
  • Ulya R. Karpuzcu
چکیده

Systematic checkpointing of the machine state makes restart of execution from a safe state possible upon detection of an error. The time and energy overhead of checkpointing, however, grows with the frequency of checkpointing. Amortizing this overhead becomes especially challenging, considering the growth of expected error rates, as checkpointing frequency tends to increase with increasing error rates. Based on the observation that due to imbalanced technology scaling, recomputing a data value can be more energy efficient than retrieving (i.e., loading) a stored copy, this paper explores how recomputation of data values (which otherwise would be read from a checkpoint from memory or secondary storage) can reduce the machine state to be checkpointed, and thereby reduce the checkpointing overhead. Specifically, the resulting amnesic checkpointing framework AmnesiCHK can reduce the storage overhead by up to 23.91%; time overhead, by 11.92%; and energy overhead, by 12.53%, respectively, even in a relatively small scale system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Covering resilience: A recent development for binomial checkpointing

Nowadays, adjoint methods form a well established approach to compute gradient information in a very efficient way in terms of runtime. However, as soon as the considered process involves any kind of nonlinearity, the memory requirement to compute the corresponding adjoints is in principle proportional to the operation count of the underlying function, see, e.g., [1, Sec. 4.6]. For this reason,...

متن کامل

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...

متن کامل

Asynchronous Two-level Checkpointing Scheme for Large-scale Adjoints in the Spectral-element Solver Nek5000

Adjoints are an important computational tool for large-scale sensitivity evaluation, uncertainty quantification, and derivative-based optimization. An essential component of their performance is the storage/recomputation balance in which efficient checkpointing methods play a key role. We introduce a novel asynchronous two-level adjoint checkpointing scheme for multistep numerical time discreti...

متن کامل

Enabling user-driven Checkpointing strategies in Reverse-mode Automatic Differentiation

Abstract. This paper presents a new functionality of the Automatic Differentiation (AD) Tool tapenade. tapenade generates adjoint codes which are widely used for optimization or inverse problems. Unfortunately, for large applications the adjoint code demands a great deal of memory, because it needs to store a large set of intermediates values. To cope with that problem, tapenade implements a su...

متن کامل

Avoiding recomputation in linkage analysis.

We describe four improvements we have implemented in a version of the genetic linkage analysis programs in the LINKAGE package: subdivision of recombination classes, better handling of loops, better coordination between the optimization and output routines, and a checkpointing facility. The unifying theme for all the improvements is to store a small amount of data to avoid expensive recomputati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1710.04685  شماره 

صفحات  -

تاریخ انتشار 2017